A common subject of debate in how media relays information to the public is the relative coverage that a particular topic receives in their Facebook feeds. We attempt to objectively answer this question by using topic modelling techniques.
In order to determine the distribution of topics that news websites cover, we need to find a way to map the the headline, caption, and other raw text to a particular topic, without having prior knowledge on what those topics are. This translates to an unsupervised classification problem on natural language. One of the most common algorithms for dealing with this information is Latent Dirichlet Allocation (LDA).1
This model views the text generation process as conforming to the following characteristics:
LDA is the algorithm that simultaneously maps words to topics and topics to documents. One of the important considerations of this model is that the number of topics \(k\) must be known a-priori.
See here for an excellent detailed video explanation of the LDA algorithm, here for the seminal paper on the algorithm, and here for the specific implementation in R.
We load libraries needed for this analysis.
suppressPackageStartupMessages({
library(dplyr) # Data manipulation
library(stringr) # String manipulation
library(lubridate) # Date and time manipulation
library(purrr) # Functional programming
library(tidyr) # Reshaping
library(magrittr) # Advanced piping
library(pushoverr) # Pushover notifications
library(doMC) # Parallel Computing
library(readr) # Importing data
library(tibble) # Better data frames
library(ggplot2) # Static data visualization
library(ggrepel) # Repel text labels
library(ggiraph) # GGplot interactive
library(scales) # Scales
library(viridis) # Viridis color scales
library(htmlwidgets) # JS visuliaztions
library(httr) # HTTP functions
library(jsonlite) # JSON parsing
library(tidytext) # Tidy text mining
library(hunspell) # Text processing
library(stringdist) # String distances
library(topicmodels) # Topic modelling
library(proxy) # Distance measures
})
# Set proper working directory
if (interactive() & !str_detect(getwd(), "src")) setwd("src")
# Theming
quartzFonts(
Roboto =
c("Roboto-Light",
"Roboto-Bold",
"Roboto-Regular",
"Roboto-Thin")
)
theme_set(
theme_bw(base_family = "Roboto", base_size = 14) +
theme(
plot.title = element_text(face = "bold", size = 14,
margin = margin(0, 0, 4, 0, "pt")),
plot.subtitle = element_text(size = 12),
plot.caption = element_text(size = 6, hjust = 0),
axis.title = element_text(size = 10),
panel.border = element_blank()
)
)
# Functions
source("../func/01-page-scraping-functions.R")
source("../func/02-topic-modelling-functions.R")From the data that we have extracted via the Facebook Graph API (details can be found here).
# Create linkage to database
src_sqlite(
path = "../dta/03-fbpages.sqlite",
create = FALSE
) -> fbpages.sqlite
# Define required tables
fbpages.sqlite %>% tbl("fact_posts") -> posts.sql
fbpages.sqlite %>% tbl("fact_posts_attachments") -> attachments.sql
# Load data
posts.sql %>%
left_join(attachments.sql, by = c("post_id" = "object_id")) %>%
collect(n = Inf) ->
posts.dt
# Cache data
save(posts.dt, reactions.dt, file = "../cache/raw_data.rda")
# Load top 100 fbpages dataset
top_100_fbpages.dt <- readRDS("../dta/04-news-pages.rds")posts.dt %>%
# Remove duplicates
group_by(post_id) %>%
filter(row_number() == 1) %>%
ungroup() %>%
# Timestamp type and timezone conversion
mutate(
post_timestamp_utc =
post_timestamp_utc %>%
as.POSIXct(origin = "1970-01-01", tz = "UTC"),
post_timestamp_local =
post_timestamp_utc %>%
with_tz("Asia/Manila")
) ->
posts_clean.dtWe require additional fields in order to make the analysis richer. We query from the Facebook Graph API for the description, url, and attachment_media_src field of the attachments.
if (!file.exists("../cache/topic-modeling-posts-augment.rds")) {
# Authenticate
load("../bin/fb_auth.rda")
getAppAccessToken(
fb_app_id = fb_app_id,
fb_app_secret = fb_app_secret
)
# Create augmentation function
getFBAttachmentData <- function(post_ids, partition) {
cat(unique(partition), "\n")
Sys.sleep(0.5)
callFBGraphAPI(
node = "attachments",
query = list(
ids = paste0(post_ids, collapse = ","),
fields = "description,url,media{image{src}}"
)
)
}
# Loop over attachments and prepare data
posts_clean.dt %>%
mutate(partition = ceiling(row_number()/20)) %>%
split(.$partition) %>%
map(~getFBAttachmentData(.$post_id, .$partition)) ->
posts_augment.ls
# Flatten and process data
posts_augment.ls %>%
map(~map(., "data")) %>%
flatten_df("post_id") %>%
mutate(media = map_chr(media, ~.$src %||% NA_character_)) %>%
rename(
attachment_description = description,
attachment_url = url,
attachment_media_src = media
) ->
posts_augment.dt
# Save to cache
saveRDS(posts_augment.dt, "../cache/topic-modeling-posts-augment.rds")
} else {
readRDS("../cache/topic-modeling-posts-augment.rds") ->
posts_augment.dt
}Now that all data transforms are complete, we consolidate objects in memory.
# Join to original table
posts_clean.dt %>%
left_join(posts_augment.dt, by = "post_id") %>%
select(-attachment_media_url) ->
posts_complete.dt
# Filter to 2016 posts and rename to original file
posts_complete.dt %>%
filter(year(post_timestamp_local) == 2016) ->
posts.dt
# Remove objects in memory
rm(posts_augment.dt, posts_clean.dt, posts_complete.dt, posts_augment.ls)We use a Tagalog hunspell library created by Jan Alonzo (GNU GPLv2) for the succeeding parts of this analysis.
For this analysis, we only include post types that do easily lend themselves to text analysis, therefore video, photo, and other complex post types are removed; only types share and quoted_share are included.
In this dataset, we have three columns of text data on which to perform our analysis:
post_message - the message that appears on the top of the article and is written for each post,attachment_title - the title of the linked article, usually the headline, andattachment_description - the summary text appearing below the title, usually an excerpt of the full article linked.We took a look at possible discrepancies in the data by looking at cases where any of these columns are empty - we have isolated them as either data errors or deliberately empty. In either case, they do not warrant special treatment.
We simple construct the raw text field by concatenating the three columns, separated by a space. We then tokenize it by word to transform into tidy format.
posts.dt %>%
# Filter to text types only
filter(attachment_type %in% c("share", "quoted_share")) %>%
# Construct text field
unite(text, post_message, attachment_title, attachment_description, sep = " ") %>%
# Retain only relevant columns
select(post_id, text) %>%
# Tokenize by word
unnest_tokens(word, text, token = "words", to_lower = TRUE) %>%
unnest_tokens(word, word, token = "regex", pattern = "[.]", collapse = FALSE) %>%
unnest_tokens(word, word, token = "regex", pattern = "[_]", collapse = FALSE) %>%
unnest_tokens(word, word, token = "regex", pattern = "[:]", collapse = FALSE) %>%
# Cleaning apostrophe words
mutate(word = word %>% str_replace("['’‘].*$", "")) ->
# Assign to variable
posts_tokenized.dtWe first try to filter out words with invalid characters and numbers.
# Get unique characters
posts_tokenized.dt %>%
distinct(word) %>%
mutate(word = iconv(word, to = "UTF-8")) %>%
mutate(chars = str_split(word, pattern = "")) %$%
flatten_chr(chars) %>%
unique() ->
unique_characters.vec
unique_characters.vecAs you can see there are quite a few characters that don’t make sense. We remove words that contain invalid characters.
# Get invalid characters
unique_characters.vec[!str_detect(unique_characters.vec, "[a-zá-ž]")] %>%
{ paste0("[", paste0(., collapse = ""), "]") } ->
invalid_characters.rgx
# View words with invalid characters
posts_tokenized.dt %>%
filter(!str_detect(word, invalid_characters.rgx)) ->
posts_tokenized.dtWe also remove words that are less than 3 characters long.
# Remove short words
posts_tokenized.dt %>%
filter(str_length(word) >= 3) ->
posts_tokenized.dtGiven that this is post information and not comments, we don’t have to do much spell checking, but we do so as a precautionary step.
posts_tokenized.dt %>%
# Tagalog and English spell checks
filter(!hunspell_check(word, "tl_PH") & !hunspell_check(word, "en_US")) %>%
distinct(word) ->
posts_misspelled_words.dt
posts_misspelled_words.dt %>% sample_n(1000) %$% wordIt seems that most of the words that do not pass the spell check are either not in the dictionary or are proper nouns. We therefore decide that it’s not worth correcting the spelling as it will probably impact the true meaning of the words.
Stemming is process of converting words into their root words, so we get to the root idea of the issue and relate similar words to each other. Stemming implemented by hunspell and SnowballC seems to be too strict, so we implement the following stemming patterns (regex):
# Compute stem words
posts_tokenized.dt %>%
stemWords(c("es$", "s$", "ed$", "d$", "ing$", "ly$"), "en_US") ->
posts_stems_en.dt
posts_tokenized.dt %>%
stemWords(c("^nag", "^na", "um", "in", "an$", "in$"), "tl_PH") ->
posts_stems_tl.dt
# Show some examples
posts_stems_en.dt %>% sample_n(20)
posts_stems_tl.dt %>% sample_n(20)# Check no overlap between tagalog and english stem words
stopifnot(
intersect(posts_stems_tl.dt$word, posts_stems_en.dt$word) %>% length() == 0
)
# Combine
posts_stems.dt <-
rbind(
posts_stems_en.dt,
posts_stems_tl.dt
)
# Add to tokenized posts
posts_tokenized.dt %>%
left_join(posts_stems.dt, by = "word") %>%
mutate(word = ifelse(is.na(stem_word), word, stem_word)) %>%
select(-stem_word) ->
posts_tokenized.dt
# Clean up
rm(posts_stems.dt, posts_stems_en.dt, posts_stems_tl.dt)We first eliminate stopwords, i.e. words that are common to a language and do not contain meaning. For the purpose of this model, we use stopwords for English and Filipino/Tagalog.
# Collect English and Tagalog stopwords
stopwords.ls <- c(
stop_words$word,
fromJSON(
"https://raw.githubusercontent.com/stopwords-iso/stopwords-tl/master/stopwords-tl.json"
),
# Manual stopwords
"thy", "monday", "tuesday", "wednesday", "thursday", "friday", "saturday", "sunday",
"befullyinformed", "sen", "cnn", "final", "start", "san", "anti", "http", "https",
"iii"
)
# Filter out stopwords
posts_tokenized.dt %>%
filter(!word %in% stopwords.ls) ->
posts_tokenized.dtWe summarise the data format into a more compressed form and convert it into a document term matrix.
# Summarise
posts_tokenized.dt %>%
group_by(post_id, word) %>%
summarise(term_frequency = n()) %>%
ungroup() ->
posts_tokenized.dtWe remove words that do not appear in at least 5 documents (3 posts are totally eliminated).
# Select rare words
posts_tokenized.dt %>%
group_by(word) %>%
summarise(count = n()) %>%
ungroup() %>%
filter(count < 5) ->
posts_rare_words.dt
# Remove rare words
posts_tokenized.dt %>%
anti_join(posts_rare_words.dt, by = "word") ->
posts_tokenized.dt
# Clean up
rm(posts_rare_words.dt)# Convert to document term matrix
posts_tokenized.dt %>%
cast_dtm(post_id, word, term_frequency) ->
posts_tokenized.dtmIn order to determine the ideal number of topics \(k\), we perform 5-fold cross validation on perplexity at different values of \(k\). We then compute the rate of perplexity change (RPC) and use the elbow point as the ideal number of topics. This method is detailed in Zhao et al (2015)2.
On a 10% random sample, we estimate the perplexity using 5-fold cross validation for different values of \(k\).
set.seed(7292)
crossValidatePerplexity(
dtm = posts_tokenized.dtm[
sample(1:posts_tokenized.dtm$nrow, posts_tokenized.dtm$nrow * 0.1),],
n_folds = 5,
k_candidates = c(10, 15, 20, 25, 30, 35, 40, 45, 50),
method = "Gibbs"
) ->
posts_tokenized_perplexity_cv.dtplotCrossValidatedPerplexity(posts_tokenized_perplexity_cv.dt) +
annotate("segment", x = 40, xend = 40, y = -Inf, yend = Inf, color = "red") +
labs(
title = "RATE OF CROSS-VALIDATED PERPLEXITY CHANGE",
subtitle =
"Change in LDA Perplexity, 5-fold cross validation, 2016 Facebook News Corpus",
x = "Number of Topics (k)",
y = "Rate of Perplexity Change\n(units/marginal topic)",
caption = "\nLine indicates average perplexity change\nPoints indicate folds"
) Note: this took 8 hours to run in parallel on 3 cores.
As you can see, the change point (where the rate of perplexity change no longer falls significantly with additional topics) is at 40 topics.
We then train the LDA on the full dataset, with \(k = 40\).
# Train the LDA model
LDA(
x = posts_tokenized.dtm,
k = 40,
method = "Gibbs",
control = list(seed = 7292)
) -> posts.ldaTraining took 2 hours on the full dataset. We extract the word-topic probabilities, and the document-topic probabilities produced by the model.
# Word-topic probabilities
posts.lda %>% tidy("beta") -> posts.wtp
# Document-topic probabilities
posts.lda %>% tidy("gamma") -> posts.dtpWe can then assess the word-topic probabilities in order to get an idea of the topic that is most.
Since we have over 30,000 unique terms in the corpus, we need to extract the top few words that most uniquely define each topic, so that we can more easily visualize and label them. We use the measure of relevance defined by Sievert and Shirley (2014)3. Relevance of term \(w\) to topic \(k\) given a weight parameter \(\lambda\) between 0 and 1 \(r(w, k | \lambda)\) is computed as:
\[ r(w, k | \lambda) = \lambda\log(\phi_{kw}) + (1-\lambda)\log(\frac{\phi_{kw}}{p_w}) \]
where \(\phi_{kw}\) is the probability of term \(w\) for topic \(k\), and \(p_w\) is the empirical probability of the word in the corpus. \(\lambda\) can be thought of as a weighting term between ranking by the probability of that word within the topic, and ranking by the lift over the overall probability in the corpus.
In user studies, Sievert and Shirley (2004) found that a lambda value of 0.6 as an optimal value for allowing humans to identify the topics associated with the top words ranked by relevance. We use this same value for lambda.
posts.wtp %>%
# Compute lambda and phi_kw
mutate(lambda = 0.6, phi_kw = beta) %>%
# Compute and join the p_w
left_join(
posts_tokenized.dt %>%
group_by(word) %>%
summarise(frequency = sum(term_frequency)) %>%
ungroup() %>%
mutate(p_w = frequency/sum(frequency)) %>%
select(-frequency),
by = c("term" = "word")
) %>%
# Compute the relevance
mutate(relevance = lambda * log(phi_kw) + (1 - lambda) * log(phi_kw/p_w)) ->
word_relevance.dtWe take a look at samples of articles labeled with the topics and the most relevant words that use them, and label them according to the prevalent theme. These are recorded in a csv file and loaded in:
# Read in topics mapping data
topics_mapping.dt <- read_csv("../dta/lda-topics.csv", col_types = "icc")We extract the top 30 most relevant words and plot them as follows:
posts.wtp %>%
# Compute Jensen-Shannon
computeJensenShannonPCA() %>% {
# Assign positions to top 30 most relevant words
left_join(
.,
word_relevance.dt %>%
group_by(topic) %>%
top_n(30, relevance) %>%
ungroup(),
by = "topic"
) -> topic_labelling_top30words.dt
# Assign positions to topic labels
left_join(
.,
topics_mapping.dt,
by = "topic"
) -> topic_labelling_topics.dt
ggsave(
plot = {
topic_labelling_top30words.dt %>%
ggplot(aes(x = x, y = y)) +
geom_text_repel(
aes(label = term,
color = factor(topic, levels = 1:40),
size = phi_kw * p_w),
lineheight = 0.6,
segment.size = NA,
box.padding = unit(0, "lines"),
force = 0.05,
family = "Roboto"
) +
geom_text_repel(
data = topic_labelling_topics.dt,
aes(label = str_wrap(topic_name, 15),
color = factor(topic, levels = 1:40)),
size = 2.25, lineheight = 0.7,
segment.size = 0.5, alpha = 0.7,
family = "Roboto", fontface = 2,
force = 0.3
) +
scale_size_continuous(range = c(1, 4)) +
scale_color_manual(
values = rainbow(40) %>%
adjustcolor(red.f = 0.6, green.f = 0.6, blue.f = 0.6)
) +
theme(
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank(),
panel.border = element_blank(),
plot.background = element_rect(fill = "#FAFAFA"),
legend.position = "none"
) +
labs(
title = "PHILIPPINE NEWS LANDSCAPE",
subtitle = "Headlines and news summaries from Philippine news Facebook pages in the year 2016",
caption =
"DATA SOURCE: Facebook Graph API
CHART NOTES:
1. Topics were derived through Latent Dirichlet Allocation (LDA) with 40 topics
2. For each topic, the top 30 words in terms of relevance as defined in Sievert and Shirley (2014) are displayed
3. The centers of each topic mass were determined by applying principal components analysis to the distance matrix, with the Jensen-Shannon divergence as the distance measure.
This layout is intended to display the degree of semantic distance.
TJPALANCA.COM | TROY JAMES PALANCA"
)
},
filename = "../figs/01-news-landscape-map.png",
device = "png",
dpi = 600
)
}Facebook News Map
Now that we have labelled each topic, we produce sample documents classified to that topic for reference. We produce a random sample of 10 posts from the 300 highest probability fits per topic.
set.seed(9272)
posts.dtp %>%
group_by(post_id = document) %>%
summarise(topic = min(topic[gamma == max(gamma)]), gamma = max(gamma)) %>%
ungroup() %>%
inner_join(posts.dt %>% select(post_id, attachment_title), by = "post_id") %>%
inner_join(topics_mapping.dt, by = "topic") %>%
mutate(topic_title =
paste0("Topic ", formatC(topic, flag = "0", width = 2),
" - ", topic_name)) %>%
group_by(topic_name) %>%
top_n(300, gamma) %>%
sample_n(10) %>%
mutate(row = row_number()) %>%
ungroup() ->
posts_classification.sdtWe produce a chart that shows this in a presentable manner.
ggsave(
plot = {
posts_classification.sdt %>%
ggplot(aes(x = 0, y = row, color = factor(topic, levels = 1:40))) +
facet_wrap(~topic_title, ncol = 5) +
geom_text(
aes(label = attachment_title),
size = 2, hjust = 0, family = "Roboto"
) +
geom_point(aes(x = -0.025), size = 0.5) +
scale_x_continuous(limits = c(-0.05, 1), expand = c(0, 0)) +
scale_color_manual(
values = rainbow(40) %>%
adjustcolor(red.f = 0.6, green.f = 0.6, blue.f = 0.6)
) +
theme(
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank(),
strip.background = element_blank(),
panel.border = element_blank(),
strip.text = element_text(face = "bold", hjust = 0, size = 6),
plot.background = element_rect(fill = "#FAFAFA"),
legend.position = "none"
) +
labs(
title = "PHILIPPINE NEWS LANDSCAPE",
subtitle = "Sample news articles from Facebook in the year 2016",
caption =
"DATA SOURCE: Facebook Graph API
CHART NOTES:
1. Topics were derived through Latent Dirichlet Allocation (LDA) with 40 topics
2. Documents were classified according to the topic that has the highest probability
TJPALANCA.COM | TROY JAMES PALANCA"
)
},
device = "png",
filename = "../figs/02-topic-classification-sample.png",
dpi = 600, height = 10, width = 16
)